First, we get the data. We downloaded it fresh from Google Trends, looking at all 3 together in comparison.
immigration_data <- read_csv('from_google_trends/google_trends_collective.csv', skip=3, col_names=c('month', 'welfare_trend', 'crime_trend', 'report_trend')) |>
mutate(month = parse_date(month, "%Y-%m")) |>
mutate(president=as.factor(ifelse(year(month) <= 2008, "Bush", ifelse(year(month) <= 2016, "Obama", "Trump"))))
## Rows: 192 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (3): welfare_trend, crime_trend, report_trend
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, we recreate the plot.
immigration_data |>
pivot_longer(names_to = "trend_type", values_to = "trend_value", cols = ends_with("_trend")) |>
mutate(trend_type=factor(trend_type, labels=c("report_trend", "crime_trend", "welfare_trend"))) |>
ggplot(aes(x=month, y=trend_value, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
facet_wrap(~ trend_type, dir="v")
## Table 3
Now we replicate table 3.
table_3_data <- immigration_data |>
mutate(is_bush = president == "Bush", is_trump = president == "Trump")
models <- list(
crime = lm(crime_trend ~ month + is_bush + is_trump, data = table_3_data),
welfare = lm(welfare_trend ~ month + is_bush + is_trump, data = table_3_data),
report = lm(report_trend ~ month + is_bush + is_trump, data = table_3_data)
)
msummary(models, output='markdown')
| crime | welfare | report | |
|---|---|---|---|
| (Intercept) | -31.109 | -17.690 | 60.338 |
| (13.241) | (10.542) | (16.691) | |
| month | 0.003 | 0.002 | -0.001 |
| (0.001) | (0.001) | (0.001) | |
| is_bushTRUE | 2.836 | 1.673 | -0.748 |
| (2.388) | (1.901) | (3.010) | |
| is_trumpTRUE | 7.348 | 3.798 | 15.219 |
| (2.294) | (1.826) | (2.892) | |
| Num.Obs. | 192 | 192 | 192 |
| R2 | 0.351 | 0.265 | 0.197 |
| R2 Adj. | 0.341 | 0.253 | 0.184 |
| AIC | 1345.8 | 1258.3 | 1434.7 |
| BIC | 1362.1 | 1274.6 | 1451.0 |
| Log.Lik. | -667.905 | -624.134 | -712.370 |
| RMSE | 7.84 | 6.24 | 9.89 |
We are getting different values for the model. Not only are the constants off, but they have welfare increasing by date, while we have it decreasing. They also have big positive changes from Bush. We have small positive changes from Crime and Welfare, and by Reporting, where they have a small positive change, we have a small negative change. And they find bigger changes for Trump than we do.
But, this still supports their point from this table, which was that Trump’s election led to a large positive increase in all these searches, which we do find.
table_3_data |>
add_predictions(models[["report"]]) |>
ggplot(aes(x=month, y=report_trend, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
geom_line(aes(y=pred), linetype="dashed")
table_3_data |>
add_predictions(models[["crime"]]) |>
ggplot(aes(x=month, y=crime_trend, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
geom_line(aes(y=pred), linetype="dashed")
table_3_data |>
add_predictions(models[["welfare"]]) |>
ggplot(aes(x=month, y=welfare_trend, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
geom_line(aes(y=pred), linetype="dashed")
This looks very different from out original plot, but that makes sense.
Before, geom_smooth was using * to have different slopes for the
different presidencies. Since here, we only have a constant based on who
was president, we have one slope for everything.
Now, we get the topic model data. The topic model was given to us by Jake Hofman of Microsoft Research. The model is not included in the repository because it was too big to fit. Instead, we have two matrices from the model, one containing the topics in each document and one containing each term’s topic distribution.
load('from_replication_files/topic_model_lite.RData')
document_topics <- document_topics |>
mutate(date = ymd(date))
First, we make figure 2. It is about immigration coverage in general, not of a specific category, so I think all the documents there count.
campaign_start <- document_topics |>
filter(time == "pre-election") |>
arrange(date) |>
slice_tail(n=1) |>
pull(date)
campaign_end <- document_topics |>
filter(time == "post-election") |>
arrange(date) |>
slice_head(n=1) |>
pull(date)
document_topics |>
mutate(month = floor_date(date, "month")) |>
group_by(month, channel, time) |>
summarize(num_segments = n()) |>
ggplot(aes(x=month, y=num_segments, color=channel)) +
geom_point() +
geom_smooth(aes(group = interaction(channel, time)), se=F) +
geom_vline(aes(xintercept=campaign_start), linetype="dashed") +
geom_vline(aes(xintercept=campaign_end), linetype="dashed") +
scale_color_manual(values=c("magenta", "red", "blue"))
## `summarise()` has grouped output by 'month', 'channel'. You can override using
## the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We have essentially the same results as they do, but with a less severe
jump after the campaign start and a bigger drop before the inauguration.
But their overall findings still show.
document_topics |>
mutate(month = floor_date(date, "month")) |>
group_by(month, channel, time) |>
summarize(proportion_1 = sum(Topic1), proportion_3 = sum(Topic3), crime_coverage = proportion_1 + proportion_3) |>
ggplot(aes(x=month, y=crime_coverage, color=channel)) +
geom_point() +
geom_smooth(aes(group = interaction(channel, time)), se=F) +
geom_vline(aes(xintercept=campaign_start), linetype="dashed") +
geom_vline(aes(xintercept=campaign_end), linetype="dashed") +
scale_color_manual(values=c("magenta", "red", "blue"))
## `summarise()` has grouped output by 'month', 'channel'. You can override using
## the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
document_topics |>
mutate(month = floor_date(date, "month")) |>
group_by(month, channel, time) |>
summarize(welfare_coverage = sum(Topic13)) |>
ggplot(aes(x=month, y=welfare_coverage, color=channel)) +
geom_point() +
geom_smooth(aes(group = interaction(channel, time)), se=F) +
geom_vline(aes(xintercept=campaign_start), linetype="dashed") +
geom_vline(aes(xintercept=campaign_end), linetype="dashed") +
scale_color_manual(values=c("magenta", "red", "blue"))
## `summarise()` has grouped output by 'month', 'channel'. You can override using
## the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This time, we get identical results. I suspect that in the original, we
only differ because of how we deal with the months bifrucated by the
dotted lines. We were consistent, always treating them as multiple
points, causing the lines to cross the dotted lines. They only seem to
have done that for the second one for some reason.
Now we reproduce table 4. This is before we get the daily data, so we do it on a monthly level.
topic_table <- document_topics |>
mutate(month = floor_date(date, "month")) |>
group_by(month, time) |>
summarize(proportion_1 = sum(Topic1), proportion_3 = sum(Topic3), crime_coverage = proportion_1 + proportion_3, welfare_coverage = sum(Topic13), num_segments = n()) |>
mutate(trump_admin = time == "post-election", month_of_year=month(month, label=T)) |>
inner_join(immigration_data, by="month")
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
lm(report_trend ~ num_segments + crime_coverage + welfare_coverage + trump_admin + month + month_of_year, topic_table) |>
msummary(output='markdown')
| (1) | |
|---|---|
| (Intercept) | 83.418 |
| (54.312) | |
| num_segments | 0.000 |
| (0.003) | |
| crime_coverage | 0.084 |
| (0.046) | |
| welfare_coverage | 0.291 |
| (0.145) | |
| trump_adminTRUE | 15.082 |
| (3.821) | |
| month | -0.003 |
| (0.003) | |
| month_of_year.L | -11.362 |
| (3.452) | |
| month_of_year.Q | 9.569 |
| (3.306) | |
| month_of_year.C | 0.599 |
| (3.358) | |
| month_of_year^4 | 0.059 |
| (3.395) | |
| month_of_year^5 | 0.713 |
| (3.442) | |
| month_of_year^6 | -2.255 |
| (3.461) | |
| month_of_year^7 | 3.271 |
| (3.442) | |
| month_of_year^8 | -0.610 |
| (3.400) | |
| month_of_year^9 | -1.815 |
| (3.345) | |
| month_of_year^10 | -3.696 |
| (3.262) | |
| month_of_year^11 | 0.880 |
| (3.241) | |
| Num.Obs. | 71 |
| R2 | 0.715 |
| R2 Adj. | 0.630 |
| AIC | 512.7 |
| BIC | 553.4 |
| Log.Lik. | -238.356 |
| RMSE | 6.95 |
Obviously, this is different from their result, because it is on a monthly level. But we found significantly lower effects of number of crime and welfare coverage, and no effect from the number of segments. These things shouldn’t change, but perhaps the effect only exists in the short-term (but if so, that changes the paper’s conclusions). Though we did still have a high modifier for Trump’s presidency, presumably because that was a long-term effect.
We want to recreate figure 4 and table 3 using the data they used to see if we still get the same results. This data was supplied again by Jake Hofman, but we have included it for anyone trying to replicate this.
report_data <- read_csv('from_replication_files/google_trends_report.csv') |>
rename(report_trend = search)
## Rows: 190 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (2): year, search
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
crime_data <- read_csv('from_replication_files/google_trends_crime.csv') |>
rename(crime_trend = search)
## Rows: 190 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (2): year, search
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
welfare_data <- read_csv('from_replication_files/google_trends_welfare.csv') |>
rename(welfare_trend = search)
## Rows: 190 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (2): year, search
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trend_data_given <- report_data |>
inner_join(crime_data, by=c("year", "month")) |>
inner_join(welfare_data, by=c("year", "month")) |>
mutate(month = paste(year, month), month = parse_date(month, "%Y %m")) |>
mutate(president=as.factor(ifelse(year(month) <= 2008, "Bush", ifelse(year(month) <= 2016, "Obama", "Trump"))))
trend_data_given |>
pivot_longer(names_to = "trend_type", values_to = "trend_value", cols = ends_with("_trend")) |>
mutate(trend_type=factor(trend_type, labels=c("report_trend", "crime_trend", "welfare_trend"))) |>
ggplot(aes(x=month, y=trend_value, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
facet_wrap(~ trend_type, dir="v")
table_3_data_given <- trend_data_given |>
mutate(is_bush = president == "Bush", is_trump = president == "Trump")
list(
crime = lm(crime_trend ~ month + is_bush + is_trump, data = table_3_data_given),
welfare = lm(welfare_trend ~ month + is_bush + is_trump, data = table_3_data_given),
report = lm(report_trend ~ month + is_bush + is_trump, data = table_3_data_given)
) |> msummary(output='markdown')
| crime | welfare | report | |
|---|---|---|---|
| (Intercept) | -4.163 | 28.585 | 72.574 |
| (23.241) | (20.710) | (16.089) | |
| month | 0.001 | 0.000 | -0.002 |
| (0.001) | (0.001) | (0.001) | |
| is_bushTRUE | 8.280 | 7.819 | 0.785 |
| (4.187) | (3.731) | (2.899) | |
| is_trumpTRUE | 19.903 | 18.560 | 19.716 |
| (4.027) | (3.588) | (2.788) | |
| Num.Obs. | 190 | 190 | 190 |
| R2 | 0.262 | 0.237 | 0.286 |
| R2 Adj. | 0.250 | 0.225 | 0.275 |
| AIC | 1544.4 | 1500.6 | 1404.7 |
| BIC | 1560.7 | 1516.8 | 1420.9 |
| Log.Lik. | -767.216 | -745.305 | -697.339 |
| RMSE | 13.72 | 12.23 | 9.50 |
With this, we get the same data as he does. We still don’t get quite the same regression, but we get closer. Interestingly, a lot of times, they have non-zero trends when we have zero trends.
But it’s also true that he downloaded this in 3 separate files while we did it in a single file. Since Google measures things relative to each other when you download them together, it’s possible that this changes things.
Now we will replicate Table 4 using the given day values. Once again, the daily values used by the original paper were provided to us, and you can use it to replicate this.
given_daily_data <- read_csv('from_replication_files/gt_report_daily.csv')
## New names:
## Rows: 5751 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (3): ...1, search, search_adj date (1): date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
topic_table_daily <- document_topics |>
group_by(date, time) |>
summarize(proportion_1 = sum(Topic1), proportion_3 = sum(Topic3), crime_coverage = proportion_1 + proportion_3, welfare_coverage = sum(Topic13), num_segments = n()) |>
mutate(trump_admin = time == "post-election", month_of_year=month(date, label=T), day_of_week=wday(date, label=T)) |>
inner_join(given_daily_data, by="date")
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
list(
search_model=lm(search ~ num_segments + crime_coverage + welfare_coverage + trump_admin + date + day_of_week + month_of_year, topic_table_daily),
search_adj_model=lm(search_adj ~ num_segments + crime_coverage + welfare_coverage + trump_admin + date + day_of_week + month_of_year, topic_table_daily)
)|>
msummary(output='markdown')
| search_model | search_adj_model | |
|---|---|---|
| (Intercept) | 97.103 | 112.333 |
| (23.412) | (23.691) | |
| num_segments | 0.049 | 0.059 |
| (0.018) | (0.018) | |
| crime_coverage | 0.006 | 0.596 |
| (0.263) | (0.266) | |
| welfare_coverage | 1.319 | 1.895 |
| (0.677) | (0.685) | |
| trump_adminTRUE | 7.750 | 19.614 |
| (1.672) | (1.692) | |
| date | -0.003 | -0.005 |
| (0.001) | (0.001) | |
| day_of_week.L | 1.139 | 1.356 |
| (1.090) | (1.103) | |
| day_of_week.Q | -7.004 | -6.748 |
| (1.094) | (1.107) | |
| day_of_week.C | -0.031 | -0.627 |
| (1.088) | (1.101) | |
| day_of_week^4 | -0.635 | -1.020 |
| (1.087) | (1.100) | |
| day_of_week^5 | 2.848 | 2.523 |
| (1.087) | (1.100) | |
| day_of_week^6 | -1.804 | -1.732 |
| (1.085) | (1.098) | |
| month_of_year.L | 4.413 | -9.821 |
| (1.488) | (1.506) | |
| month_of_year.Q | 1.605 | 8.287 |
| (1.435) | (1.452) | |
| month_of_year.C | 1.615 | -2.743 |
| (1.451) | (1.468) | |
| month_of_year^4 | 4.555 | 2.751 |
| (1.449) | (1.467) | |
| month_of_year^5 | -0.484 | 2.042 |
| (1.447) | (1.464) | |
| month_of_year^6 | -4.023 | -4.824 |
| (1.456) | (1.474) | |
| month_of_year^7 | 3.080 | 1.403 |
| (1.438) | (1.455) | |
| month_of_year^8 | -1.738 | -0.317 |
| (1.432) | (1.449) | |
| month_of_year^9 | 1.755 | -1.039 |
| (1.429) | (1.446) | |
| month_of_year^10 | -3.194 | -0.823 |
| (1.404) | (1.420) | |
| month_of_year^11 | 0.472 | 1.480 |
| (1.392) | (1.409) | |
| Num.Obs. | 2030 | 2030 |
| R2 | 0.089 | 0.266 |
| R2 Adj. | 0.079 | 0.258 |
| AIC | 17633.7 | 17681.8 |
| BIC | 17768.5 | 17816.6 |
| Log.Lik. | -8792.849 | -8816.907 |
| RMSE | 18.40 | 18.62 |
From here, we see that while neither of them are exactly identical, the one using the adjusted values is much closer to what they got. But the non-adjusted version finds a much weaker effect, with crime coverage having an ambiguous effect.
Looking at the data, it seems that the problem is that they can only look at one time period (say, month) at a time, and each month is only scaled with other searches in that month. But, I don’t understand how their normalization works.
We noticed that they got each prompt separately. Since Google Trends weights everything you are comparing to the max of the other, if we want to replicate their results, then we need to get them separately like what the researchers seem to have done. We downloaded the immigrant report, crime, and welfare trends separately.
report_data <- read_csv('from_google_trends/report_trend_individual.csv', skip=3, col_names=c('month', 'report_trend'))
## Rows: 192 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (1): report_trend
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
crime_data <- read_csv('from_google_trends/crime_trend_individual.csv', skip=3, col_names=c('month', 'crime_trend'))
## Rows: 192 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (1): crime_trend
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
welfare_data <- read_csv('from_google_trends/welfare_trend_individual.csv', skip=3, col_names=c('month', 'welfare_trend'))
## Rows: 192 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (1): welfare_trend
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trend_data_individual <- report_data |>
inner_join(crime_data, by=c("month")) |>
inner_join(welfare_data, by=c("month")) |>
mutate(month = parse_date(month, "%Y-%m")) |>
mutate(president=as.factor(ifelse(year(month) <= 2008, "Bush", ifelse(year(month) <= 2016, "Obama", "Trump"))))
trend_data_individual |>
pivot_longer(names_to = "trend_type", values_to = "trend_value", cols = ends_with("_trend")) |>
mutate(trend_type=factor(trend_type, labels=c("report_trend", "crime_trend", "welfare_trend"))) |>
ggplot(aes(x=month, y=trend_value, color=president)) +
geom_point(alpha = 0.3) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
facet_wrap(~ trend_type, dir="v")
table_3_data_individual <- trend_data_individual |>
mutate(is_bush = president == "Bush", is_trump = president == "Trump")
list(
crime = lm(crime_trend ~ month + is_bush + is_trump, data = table_3_data_individual),
welfare = lm(welfare_trend ~ month + is_bush + is_trump, data = table_3_data_individual),
report = lm(report_trend ~ month + is_bush + is_trump, data = table_3_data_individual)
) |> msummary(output='markdown')
| crime | welfare | report | |
|---|---|---|---|
| (Intercept) | -57.817 | -20.711 | 68.858 |
| (23.105) | (27.392) | (17.014) | |
| month | 0.005 | 0.003 | -0.002 |
| (0.001) | (0.002) | (0.001) | |
| is_bushTRUE | 6.951 | 4.410 | -1.618 |
| (4.166) | (4.939) | (3.068) | |
| is_trumpTRUE | 12.422 | 12.235 | 16.077 |
| (4.003) | (4.746) | (2.948) | |
| Num.Obs. | 192 | 192 | 192 |
| R2 | 0.339 | 0.190 | 0.190 |
| R2 Adj. | 0.329 | 0.177 | 0.178 |
| AIC | 1559.6 | 1625.0 | 1442.1 |
| BIC | 1575.9 | 1641.2 | 1458.4 |
| Log.Lik. | -774.798 | -807.478 | -716.049 |
| RMSE | 13.69 | 16.23 | 10.08 |
But the paper still has welfare searches going up in Trump’s presidency, while we have it going down slightly. And in the table, they have welfare going down over time, while we have it going up. So, this discrepency doesn’t explain anything.
The paper says: “We also expect that when the Trump administration receives media coverage for anti-immigrant rhetoric or policies, we will see an uptick in reporting searches. However, we do not expect a discontinuity in reporting searches immediately after the 2016 election, nor do we expect media coverage of candidate Trump’s anti-immigrant rhetoric to generate more reporting searches. Neither candidate Trump nor president-elect Trump had the power to change immigration policy before his inauguration, so his anti-immigrant positions should not increase interest in reporting” (5-6). Smriti and I are skeptical of this claim, which the paper doesn’t seem to actually prove, so we are going to investigate it and see if we find any discontinuities, and whether coverage of Candidate Trump also increases the amount of reporting.
So, we will make a regression, where we have trump and candidacy stage as * along with month see if that boosts it.
trump_extension_table <- document_topics |>
mutate(month = floor_date(date, "month")) |>
group_by(month, time, trump) |>
summarize(proportion_1 = sum(Topic1), proportion_3 = sum(Topic3), crime_coverage = proportion_1 + proportion_3, welfare_coverage = sum(Topic13), num_segments = n()) |>
mutate(month_of_year=month(month, label=T)) |>
inner_join(trend_data_individual, by="month")
## `summarise()` has grouped output by 'month', 'time'. You can override using the
## `.groups` argument.
list(
lm(report_trend ~ num_segments + crime_coverage + welfare_coverage + time * trump + month + month_of_year, trump_extension_table),
lm(report_trend ~ num_segments + crime_coverage + welfare_coverage + time * trump + month, trump_extension_table)
) |>
msummary(output='markdown')
| (1) | (2) | |
|---|---|---|
| (Intercept) | 38.257 | 77.100 |
| (54.189) | (60.811) | |
| num_segments | 0.003 | 0.008 |
| (0.004) | (0.004) | |
| crime_coverage | 0.134 | 0.077 |
| (0.061) | (0.065) | |
| welfare_coverage | 0.232 | 0.159 |
| (0.181) | (0.200) | |
| timeelection | 0.493 | 0.897 |
| (3.581) | (4.027) | |
| timepost-election | 12.474 | 15.023 |
| (5.191) | (5.857) | |
| trump | 7.010 | 9.438 |
| (3.952) | (4.352) | |
| month | 0.000 | -0.003 |
| (0.003) | (0.004) | |
| month_of_year.L | -11.771 | |
| (2.576) | ||
| month_of_year.Q | 9.783 | |
| (2.486) | ||
| month_of_year.C | 0.025 | |
| (2.549) | ||
| month_of_year^4 | 1.969 | |
| (2.533) | ||
| month_of_year^5 | 0.136 | |
| (2.567) | ||
| month_of_year^6 | -4.019 | |
| (2.552) | ||
| month_of_year^7 | 3.259 | |
| (2.576) | ||
| month_of_year^8 | -0.975 | |
| (2.551) | ||
| month_of_year^9 | -1.938 | |
| (2.508) | ||
| month_of_year^10 | -3.669 | |
| (2.491) | ||
| month_of_year^11 | 1.678 | |
| (2.453) | ||
| timeelection × trump | -3.588 | -6.538 |
| (4.615) | (5.165) | |
| timepost-election × trump | 1.081 | -1.592 |
| (4.270) | (4.785) | |
| Num.Obs. | 131 | 131 |
| R2 | 0.679 | 0.540 |
| R2 Adj. | 0.621 | 0.506 |
| AIC | 941.5 | 966.6 |
| BIC | 1004.7 | 998.2 |
| Log.Lik. | -448.730 | -472.284 |
| RMSE | 7.44 | 8.90 |
If we regress by month in addition to our other variables, then we get what the paper predicts: trump’s name causes a bigger boost post-election than the campaign. However, since the Trump boost (3.374) is more than 3, and the decrease in the campaign is less than that (-2.447), Trump still has a positive association with report searches, even in the campaign. When we stop filtering by month, this remains the case.
However, the standard error is so large that findings here may be meaningless.
Now, we check if we can find a discontinuity in reporting search rates immediately after the election in the same way that the paper found one after inauguration. We look at the discontinuity at those two locations as well as two arbitrary locations - September 1st before the election and April 1st afterwards - to see if we can find similar discontinuities.
Since the data available to us isn’t available in high-enough precision for this, we use their adjusted daily search data for more precise measurements.
discontinuity_data <- given_daily_data |>
mutate(arbitrary_before = date > ymd("2016-09-01"), after_election = date > ymd("2016-11-8"), after_inauguration = date > ymd("2017-01-20"), arbitrary_after = date > ymd("2017-04-01")) |>
filter(date >= ymd("2014-01-01"))
discontinuity_data |>
pivot_longer(c(arbitrary_before, after_election, after_inauguration, arbitrary_after), names_to = "split_location", values_to = "split_values") |>
ggplot(aes(x=date, y=search_adj, color=split_values)) +
geom_point(alpha = 0.05) +
geom_smooth(method=lm, formula = y ~ x, se=F) +
facet_wrap(~ split_location)
So, showing a bunch of splits, we can see that while regressing separately between before and after the election showed a large discontinuity, so did regressing before and after the election, which the paper claimed wouldn’t have such a discontinuity. When we use the arbitrary splits, the one afterwards shows almost no discontinuity, but the split beforehand does show one, albeit not as large. Because of this, I don’t think that showing a discontinuity in linear regressions has that precise of an effect and question its use in the paper.
We also wanted to see it as a regression so we could have a numerical measure of the discontinuity.
list(
lm(search_adj ~ date + arbitrary_before, discontinuity_data),
lm(search_adj ~ date + after_election, discontinuity_data),
lm(search_adj ~ date + after_inauguration, discontinuity_data),
lm(search_adj ~ date + arbitrary_after, discontinuity_data)
) |>
msummary(output="markdown")
| (1) | (2) | (3) | (4) | |
|---|---|---|---|---|
| (Intercept) | 71.010 | 141.576 | 155.995 | -81.162 |
| (22.202) | (22.172) | (22.193) | (22.955) | |
| date | -0.002 | -0.006 | -0.007 | 0.007 |
| (0.001) | (0.001) | (0.001) | (0.001) | |
| arbitrary_beforeTRUE | 16.113 | |||
| (1.707) | ||||
| after_electionTRUE | 22.202 | |||
| (1.692) | ||||
| after_inaugurationTRUE | 23.374 | |||
| (1.689) | ||||
| arbitrary_afterTRUE | 2.312 | |||
| (1.749) | ||||
| Num.Obs. | 2186 | 2186 | 2186 | 2186 |
| R2 | 0.107 | 0.139 | 0.146 | 0.072 |
| R2 Adj. | 0.106 | 0.138 | 0.145 | 0.071 |
| AIC | 19343.0 | 19264.6 | 19246.6 | 19428.8 |
| BIC | 19365.8 | 19287.4 | 19269.4 | 19451.5 |
| Log.Lik. | -9667.506 | -9628.312 | -9619.319 | -9710.376 |
| RMSE | 20.16 | 19.80 | 19.72 | 20.56 |
Here, we see that while the largest discontinuity is, in fact, after the inauguration, at 23.374, the election split has almost as big of a modifier with 22.202 - in fact, since it’s within the margin of error, we can’t really say which one is actually better. While the earlier arbitrary time has a smaller discontinuity, at 16.113, it is still large.
The discontinuity does show an increase of searches for reporting immigrants during Trump’s term. But the presence of a discontinuity does not mean that the exact location chose to split the data means anything.